Machine Learning Allowed Interpreting Toxicity of a Fe-Doped CuO NM Library Large Data Set—An Environmental In Vivo Case Study

The wide variation of nanomaterial (NM) characters (size, shape, and properties) and the related impacts on living organisms make it virtually impossible to assess their safety; the need for modeling has been urged for long. We here investigate the custom-designed 1–10% Fe-doped CuO NM library. Effects were assessed using the soil ecotoxicology model Enchytraeus crypticus (Oligochaeta) in the standard 21 days plus its extension (49 days). Results showed that 10%Fe-CuO was the most toxic (21 days reproduction EC50 = 650 mg NM/kg soil) and Fe3O4 NM was the least toxic (no effects up to 3200 mg NM/kg soil). All other NMs caused similar effects to E. crypticus (21 days reproduction EC50 ranging from 875 to 1923 mg NM/kg soil, with overlapping confidence intervals). Aiming to identify the key NM characteristics responsible for the toxicity, machine learning (ML) modeling was used to analyze the large data set [9 NMs, 68 descriptors, 6 concentrations, 2 exposure times (21 and 49 days), 2 endpoints (survival and reproduction)]. ML allowed us to separate experimental related parameters (e.g., zeta potential) from particle-specific descriptors (e.g., force vectors) for the best identification of important descriptors. We observed that concentration-dependent descriptors (environmental parameters, e.g., zeta potential) were the most important under standard test duration (21 day) but not for longer exposure (closer representation of real-world conditions). In the longer exposure (49 days), the particle-specific descriptors were more important than the concentration-dependent parameters. The longer-term exposure showed that the steepness of the concentration–response decreased with an increased Fe content in the NMs. Longer-term exposure should be a requirement in the hazard assessment of NMs in addition to the standard in OECD guidelines for chemicals. The progress toward ML analysis is desirable given its need for such large data sets and significant power to link NM descriptors to effects in animals. This is beyond the current univariate and concentration–response modeling analysis.


Figure S1 .
Figure S1.The Pareto front represents a list of suitable fitting functions f identified by the symbolic regressor, showing the trade-off between two key metrics: (a) the increase in R 2 , indicating higher accuracy with more complex equations, and (b) the decrease in Mean Absolute Error (MAE).The most complex fitting equation tends to be the most accurate, while the elbow of the Pareto front signifies the best balance between fitting accuracy and equation complexity.To quantify the complexity of equations, the Eureqa symbolic regressor assigns default scores to formula buildingblocks: 1 for constant, addition, subtraction, and multiplication; 2 for division; and 4 for exponential, natural logarithm, and square root functions.The results depicted in this figure refer to one repetition of the 1 st pruning round of concentration-independent variables.

Figure S2 .
Figure S2.Spearman's correlation coefficient computed between each pair of Fe-doped CuO particles variables potentially related to toxicity.In detail, the figure displays the 64 concentration-independent variables that remained after dataset cleaning.Whiter colour tones in the figure indicate no correlation between the variables, while blue tones indicate correlation.It is important to note that, as per the definition of Spearman's correlation coefficient, the matrix is symmetrical.See TableS8for a

Figure S3 .
Figure S3.Spearman's correlation coefficient computed between each pair of concentrationindependent variables within the 15 clusters identified by the hierarchical clustering algorithm (refer to TableS5).In the figure, whiter colour tones indicate less correlation between each pair of variables, while blue tones indicate higher correlation.The black colour represents the background of the figure.It is important to note that the Spearman's correlation coefficient cannot be computed within clusters consisting of only one variable (e.g., cluster #7).

Figure S4 .
Figure S4.Results of variables pruning.(a) Normalized occurrences of concentration-independent variables x i in the fitting functions f identified by the symbolic regressor for the concentrationindependent endpoint b.The definitions of the reported variables x 1 , …, x 15 are reported in the Table

Figure S5 .
Figure S5.Spearman's correlation coefficient computed between each pair of Fe-doped CuO particles descriptors related to toxicity.In detail, the figure displays the 3 concentration-independent descriptors that remained after the pruning process and the 3 concentration-dependent descriptors, as detailed in Table S7.Whiter colour tones in the figure indicate no correlation between the variables, while blue tones indicate correlation.It is important to note that, as per the definition of Spearman's correlation coefficient, the matrix is symmetrical.Results show that the reported descriptors are uncorrelated between each other.

Figure S6 .
Figure S6.Comparison of model correlations between the biological response (y) and the identified descriptors (x 1 , …, x 6 , see Table S7 for details) for Fe-doped CuO particles: experimental observations vs. model predictions.The values are normalized using the min-max approach, and each dot represents one tested configuration.(a) Fitting performance of the most complex and accurate function for Fe-doped CuO particles after 21 days exposure.(b) Fitting performance of the best compromise between model complexity and accuracy (i.e., the elbow of the Pareto front) for Fedoped CuO particles after 21 days exposure.(c) Fitting performance of the most complex and accurate function for Fe-doped CuO particles after 49 days exposure.(d) Fitting performance of the best compromise between model complexity and accuracy for Fe-doped CuO particles after 49 days exposure.